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© The present invention is related to a pipeline 
floating point processor in which the addition pipelin- 
ing is reorganized so that no wait cycle is needed 
when the addition uses the result of an immediately 
foregoing multiplication (fast multiply-add instruc- 
tion). 

The re-organization implies the following 
changes of an existing data flow of the pipeline 
floating processor shown in Fig. 4: 

1. Data feed-back via path ND of normalized data 
from the multiplier M into the aligners AL1 ,2; 

2. Shift left one digit feature on both sides of the 
data path for taking account of a possible leading 
zero digit of the product, and special zeroing of 
potential guard digits by Z1 ,2; 

3. Exponent build by 9 bits for overflow and 
underflow recognition, and due to an underflow 
the exponent result is reset to zero on the fly by a 
true zero unit (T/C). 
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The invention is related to an arrangement and 
a method in a pipeline floating-point processor 
(FLPT) of improving the performance of a multiply- 
add sequence in which the multiplication is per- 
formed within three cycles: operand read, partial 5 
sums build, and add the partial sums to end result, 
and where the addition also needs three cycles: 
operand read, operands alignment, and addition. 

Floating-point processors (FLPTs) are used to 
be functionally added to a main processor (CPU) io 
for performing scientific applications. In the entry- 
level models (e.g. 9221) of the IBM Enterprise 
Systom/9000 (ES/9000) the floating-point processor 
is tightly coupled to. the CPU and carries out all 
IBM System/390 floating-point instructions. All 75 
instructions are hardware-coded, so no microin- 
structions are needed. Moreover, binary integer 
multiplication is also implemented on the floating- 
point unit to improve overall performance. 

- Fig.1 shows the data flow of the above men- 20 
tioned floating point processor which is described 
in more "detail in the IBM Journal of Research and 
Development, Vol. 36, Number 4, July 1992. While 
the CPU is based on a four stage pipeline, the 
floating-point processor requires a five stage pipe- 25 
line to perform its most used instructions, e.g. add, 
subtract; and multiply in one cycle for double- 
precision operands (reference should be made to 
"ESA/390 Architecture", IBM Form No.: G580- 
1017-00 for more detail). 30 

The CPU resolves operand addresses, pro- 
vides operands from the cache, and handles all 
exceptions for the floating-point processor. The five 
stages of the pipeline are instruction fetch, which is 
executed on the CPU, register fetch, operand real- 35 
ignment, addition, and normalization and register 
store. 

To preserve synchronization with the CPU, a 
floating-point wait signal is raised whenever a float- 
ing-point instruction needs more than one cycle. 40 
The CPU then waits until this wait signal disap- 
pears before it increments its program counter and 
starts the next sequential instruction, which is kept 
on the bus. 

Because the IBM System/390 architecture re- 45 
quires that interrupts be precise, a wait condition is 
also invoked whenever an exception may occur. As 
can further be seen from Fig.i many bypass bus- 
ses are used to avoid wait cycles when the results 
of the foregoing instructions are used. A wait cycle so 
is needed only if the result of one instruction is 
used immediately by the next sequential instruction 
(NSI), e.g. when an add instruction follows a mul- 
tiply instruction, the result of which has to be 
augmented by the addend of the add instruction. 55 

The data flow shown in Fig.1 has two parallel 
paths for fraction processing: one add-path where 
all non-multiply/divide instructions are impiement- 
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ed, and one multiply, path specially designed for 
multiply and divide. The add-path has a fixed (60) 
bit width and consists of an operand switcher, an 
aligner, an adder, and a normalizer shifter. Instead 
of using two aligners on . each side of the operand 
paths, a switcher is used to switch operands, there- 
by saving one aligner. The switcher is also needed 
for other instructions, and so, requires, much fewer 
circuitry. 

The multiplier path consists of a booth encoder 
for the 58-bit multiplier, a multiplier macro which 
forms the 58x60-bit product terms sum and carry, 
and a 92-bit adder which delivers the result prod- 
uct. The sign and exponent paths are adjusted to 
be consistent with the add path. The exponent path 
resolves all exception and true zero situations, as 
defined by the earlier cited IBM System/390 ar- 
chitecture. 

The implementation of all other instructions, is 
merged into the add path and multiply path, and 
requires only minimal additional logic circuits. The 
data flow in Fig.1 therefore, shows more function 
blocks and multiplexer, stages than needed for only 
add, subtract, and multiply operations. 

As further can be seen from Fig.1, the data 
flow is partitioned into smaller parts FA, FB, FC, 
FD, MA. MB, PS, PC, and PL (typically registers 
with their input control). These partitions and the 
partitioning of the floating-point instructions into 
three main groups are: 

1 . ) addition/subtraction, load; 

2. ) multiplication; and 

3. ) division. * 

These are the instructions most used in scienti- 
fic applications. The first two groups of instructions 
are performed in one cycle, and division is made 
as fast as possible. 

For an add instruction, during the first two 
pipeline stages, only instruction and operand. fetch- 
ing are done. All data processing is concentrated in 
the third and fourth pipeline stages. In the fifth 
stage, the result is written back to a floating-point 
register. 

. Loading operations are treated like addition, 
with one operand equal to zero. During stage 3 the 
exponents of both operands are compared in order 
to determine the amount of alignment shift. The 
operand with the smaller exponent is then passed 
to the aligner for realignment. In stage 4 of the 
pipeline the aligned operands are added. The addi- 
tion may produce a carry-out, which results in a 
shift right by one digit position, in accordance with 
the said architecture. The exponent is then de- 
creased accordingly. . 

Since time is still available in stage 4, • the 
exponent calculation is made sequentially after that 
of addition, using only one exponent adder with an 
input multiplexer (Fig.1) to select whether an expo- 
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nent increase, an exponent adjustment, or a mul- 
tiply/divide exponent is required. 

Leading-zero detection is made -by calculating 
the hexadecimal digit sums without a propagated 
carry-in. Hexadecimal sums 0 and F for the digit 
position i are determined and fed into a multiplexer. 
The carry-in to this digit position selects whether or 
not the result digit is zero. This carry bit comes 
from the same carry-look ahead circuit used for the 
adder, so no additional circuit is needed. By using 
the above described logic, the shift amount can be 
determined at nearly the same time as the addition 
result. . 

Exponent exception, either overflow or under- 
flow, is also detected in stage 4. Meanwhile, the 
next instruction has already been started. As-earlier 
mentioned, a wait may be raised at stage 3 to hold 
execution of the next sequential instruction. In the 
case of an effective addition, the wait situation is 
met when . 

- the intermediate result exponent is 7F (hex) 
. and will overflow when an exponent incre- 
ment is causedvby a carry-out from the ad- 
der; 

- the intermediate result exponent is smaller 
■". than OD, and a normalization is required for 

unnormalized operands. 
. Here the exponent must be decreased by the 
normalization shift amount, which can be at most 
OD (decimal 14), thus producing an exponent un- 
derflow. 

Multiplication is implemented by using a modi- 
fied Booth algorithm multiplier with serial addition 
of partial product terms. It is used to be performed 
within three instruction cycles in most of the high 
performance mathematical co-processors: 

1. Operand read, 

2. Partial sums build and . . ; 

3. Add partial sums to the end result. (Reference 
should be. made to Fig.2b) 

Data bypass in the first and. third cycles allows 
a saving of one cycle when using the same result 
for a following instruction. However, one wait cycle 
is still needed as can be seen from Fig. 3, where a 
multiply instruction is immediately followed by an 
add instruction which uses as addend or augment 
the result oMhe preceding multiplication. 

- In solving mathematical problems, especially in 
matrix calculations, the sequence multiply-add, 
where the. add operation uses the result of the 
multiplication, is used very often. . . - , 

Rise (reduced instruction set computer) sys- 
tems, such as IBMs RS 6000, have a basical 
design which allows the combination of both oper- 
ations in a single complex. However, this design is 
not conform with the ESA/390 architecture earlier 
cited. Old programs may deliver different results as 
from ESA/390 mode. To avoid this a single, wait 



cycle has to be inserted (Fig.3). 

In performance calculations the UNPACK loop 
is used very often, which consists of a sequence of 
five instructions: 
5 1.) Load; 

2. ) Multiply; 

3. ) Add; 

4. ) Store; and 

5. ) Branch back. 

to The branch instruction is normally processed in 

zero-cycle so that the additional wait cycle would 
contribute to a performance degradation of 25%. 

. So, it is. the object of this invention to increase 
the . performance of pipeline floating-point proces- 
15 sors, mainly when matrix calculations have to be 
performed, with their high quantity of multiply-add 
sequences using the result of the immediately pre- 
ceding multiplication. 

This object of the invention's accomplished for 
20 an arrangement by the features of claim 1 and for 
a method by the features of claim 2. 

By applying the above features on a pipeline 
-floating-point processor the advantage of a 25% 
performance increase for multiply?add instructions 
25 will be achieved. 

A full understanding of the invention with be 
obtained from the detailed description of the pre- 
ferred embodiment presented herein below, P and 
the accompanying drawings, which are * given by 
30 way of example, wherein r 

Fig.1 . illustrates a block diagram of a 

prior art pipeline .floating-point 
,: processor; 
Figs.2a, 2b, 3 show a schematic representa- 
35 tion of various stages, .of a 

pipeline handling an add in- 
struction, a multiply instruc- 
tion, and a multiply-add in- 
; struction sequence in a pipe- 
40 line floating-point processor of 

Fig. 1; 

Fig.4 illustrates a block diagram of a 

pipeline floating-point proces- 
sor, modified in accordance 
45 ' ". with the invention; and 

Ftg.5 , shows a schematic represen- 

. . tation of the pipeline stages of 
' a floating-point processor . of 
Fig.4, handling a multiply-add 
so - . » instruction sequence; 

Fig.6-11 depict various, examples of 

conventional add and multiply 
operations -in a pipeline float- 
ing processor of Fig.4; and 
55 Fig. 12 - 15 . . depict various examples of the 

new multiply-add instruction in 
a pipeline floating point pro- 
' cessor of Fig.4. 
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The new data flow of a pipeline floating-point 
processor shown in Fig. 4 allows a zero wait pro- 
cessing of the multipiy-add instruction sequence, 
as can be seen from Fig. 5, which is obtained by 
essentially four modifications: 

1. Data feedback of normalized data from the 
multiplier M into the aligners AL1 and AL2 via 
feedback path ND; 

2. Shift left one digit by SL1 and SL2 on both 
sides of the data path for taking account of a 
possible leading zero digit of the product (spe- 
cial zeroing of guard digits); • 

3. Exponent generation by 9 bits for overflow 
and underflow recognition in 21 and Z2. Due to 
underflow the exponent result is reset to zero on 
the fly by true zero; and 

4. Both aligners AL1 and AL2 are expanded to 
16 digits. 

For performing the fast multiply-add instruction 
sequence the following procedural steps are neces- 
sary (please refer to Fig. 5): 

1. Read the operands OPDI and OPDII for per- 
forming a multiplication; 

2. Calculate the intermediate exponent product 
and build the partial sums for multiplication in 
the multiply array M. At the same time read the 
operand OPD1 for the addition; 

3. Add the partial sums of the multiply array to 
build the end product and feed the data back for 
the addition. In parallel a comparison of expo- 
nents is performed for an alignment in a 16-digit 
frame. An end alignment is then adjusted by 
one left shift if the leading digit of the product is 
zero. However, the following cases have to be 
envisaged: 

a) The product is true zero, so the operand 
coming from the multiplier array M is forced 
to zero; 

b) the intermediate product exponent is 
smaller than the OPD1 exponent, then the 
product is aligned and no further special ac- 
tions have to be taken; and 

c) the intermediate exponent is greater than 
the OPD1 exponent, then the addend has to 
be aligned; 

if the product does not have a leading zero, then 
the guard digit of the product has to be set to 
zero. But, if the product has a leading zero, then 
.-* both operands (the result operand from the mul- 
tiplication and OPD1) are shifted left by one 
digit and the 16th digit (in the example of the 
data flow of Fig.5) of the aligner becomes the 
guard digit of the result. 

4. When both operands are properly aligned, 
they will be added to the final result of the 
multiply-add instruction sequence without any 
need for a wait cycle (as can be seen from a 
comparison of Fig. 3 and Fig.5). 



The examples EX.1 - EX.10 (Fig. 6.- 15) de- 
scribed below show for addition, multiplication and 
multiplication with immediately following addition, 
where one operand is the result of the preceding 

5 multiplication, various conditions under which the 
results have to be- calculated and how the floating- 
point processors data flow handles these situations 
in the different pipeline stages in accordance with 
Figs. 2a,b, 3 and 5. 

io In a following first group of examples EX.1 - 

EX.6 (Fig. 6 - -11) various conditions are shown 
which may occur during conventional add and mul- 
tiply operations in a new floating point processors 
data flow in accordance with Fig.4. 

75 

EX.1 (Fig. 6) 

The operands OPD1 and OPD2 (augment and 
addend) are transferred to the intermediate adder 

20 input registers FA and FB during 'operand read*. 
The operands consist of a fraction value and an 
exponent. As the exponents 05 and 07 do not 
match, a right shift of the lower exponent by two 
positions is necessary for operand alignment. This 

25 is done during 'operand alignment 1 . The underflow 
value 7 of operand OPD1 is caught by a guard 
digit GD for being used later when the result (inter- 
mediate or final) of the addition has to be build 
during 'addition'. After alignment by the aligners 

30 AL1 and AL2 and after having passed through the 
shifters SL1 and SL2, and the true/complement unit 
T/C, which is interconnected between SL1 and zero 
detector Z1 the operands are stored in the input 
registers FC and FD of the adder ADD-A. ADD-A 

35 generates the intermediate result IR1 shown in 
example EX.1. 

A normalization of the fraction part of the result 
has to be performed by normalizer NORM-A which 
results in a truncated normalized fraction, and the 

40 exponents are adjusted. The final result/sum is then 
stored in output register FE. All these above oper- 
ations are done in the 'addition'-pipeline stage of 
the floating-point processor. 

45 EX.2 (Fig. 7) 

In example EX.2 a further conventional addition 
is shown where operand OPD2 is smaller than 
operand OPD1. Therefore, the fraction of operand 

so OPD1 has to be shifted to the right by the dif- 
ference (4) of the exponents (05, 01) for operand 
alignment. Again, the guard digit catches the un- 
derflow (1) of the shift operation. As the intermedi- 
ate result IR1 of the addition has three leading 

55 zeros,, a left shift by 3 is necessary, resulting in an 
exponent 02 of the final result FR, stored in adder 
output register FE. 
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EX.3 (Fig. 8) 

In example EX 3, a multiplication is depicted in 
. which- the operands OPDI and OPDII have been 

- read into the multiplier input registers MA and MB. 
As in the previous examples the operands consist 
of a fraction and an exponent. The partial sums are 
build in the multiplier array M and intermediately 
stored in the multiplier output registers PC and PS. 
In the example the actual values are omitted for 
convenience reasons: The partial product addition 
leads to an intermediate result IR2 in which, how- 
ever, one fraction part has a leading zero. This 
causes a left shift by 1 and an exponent adjust- 
ment 05 -> 04. So, output register FE now contains 
the truncated, normalized fraction as well as the 
adjusted exponent. 

EX.4 (Fig. 9)- ' 

In this , example shift operations do not seem 
' necessary after product addition. Only a truncation 
: is required to normalize the number of positions of 
the final result in output register FE. 

EX.5 (Fig. 10) ' 

; In example EX.5, operands are shown having 
negative exponents (-49y -50) and fractions of OPDI 
larger than of OPDII; The fraction values do not 
seem to result in an overflow. As the example 
shows, a shift operation by 1 left of the intermedi- 
ate result IR2 is only necessary for an exponent 
adjustment for the final result FR. However, an 
exponent underflow took place so that FR is true 
zero. 

EX.6 (Fig. 11) - 

EX.6 shows a very simple example with nega- 
tive operands . where only a truncation of IR2 is 
necessary for forming the final result FR in register 
FE. 

The following second group of examples EX.7 - 
EX.10 (Fig. 12-15) show the zero wait processing 
of the multiply-add instruction and the various pro- 
cedural steps being performed in the different pipe- 
line stages Stage 1 - Stage 4. - 

EX.7 (Fig. 12) 

As can be seen, the multiplication requires 
three phases -A1 - A3, the same number of phases 

- B.I - B3. which are necessary for an addition. The 
whole operations therefore are performed within 
four pipeline stages Stage 1 - Stage 4. 

During phase A1 both operands. OPDI and OP- 
DII are read into the input registers MA and MB of 
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the multiplier array M (pipeline stage 1). 

In the next pipeline stage 2. phase A2 the 
partial sums of the multiplication are build and 
transferred subsequently into the multiplier output 
5 registers PC and PS for being added later. In the 
same pipeline stage 2, but phase B1 operand 
OPD1 is read into an intermediate input register FA 
for adder' ADD-A. The old contents of the other 
intermediate adder input register FB which was left 
io there from a previous normal add instruction is in 
this case , of no interest because the second 
operand (OPD2) for addition of a multiply-add in- 
struction is being generated in the next pipeline 
stage 3\ phase A3 by' adding up the partial sums 
75 in adder ADD-M, thus giving the intermediate result 
IR1 which is fed-back via the feed-back path ND 
previously explained in context with Fig. 4; forming 
now operand OPD2. 

As is shown in Fig. 4 the operands on their 
20 way to adder input registers FC and FD have, if 
necessary, to undergo alignment operations in alig- 
ners AL1 , AL2 and shifters SL1 , SL2, when the 
exponents do not match or zeroing operations in 
T/C, Z1;< Z2, when leading zeros, guard digits GD 
25 included, have to be removed before the;nactual 
addition in adder ADD-A. v 

Some special situations are shown in EX:7. As 
; shown under ® , IR1 is fed-back via path ND with 
one extra digit GD (4 bits) in pipeline stage 3. The 
30 guard digit (GD = 8) resulted from the product addi- 
. tion which took place in stage 3, phase A3. 

Under (2) it is shown that the alignment of 
. operand OPD1 caused by a right shift by 3; posi- 
tions resulted . in an extended data width by two 
35 GDs (1,1), 

At the position of reference mark © it is 
shown that the operand transferred from IR1 to 
register FD has a leading zero which has to be 
removed by a left shift so that the resulting expo- 
40 nent Exp is changed from 05 > 04. 

During stage 3, phase A3, further the contents 
of FE is truncated and normalized. This causes an 
exponent adjustment of -1 (05 > 04): For forming 
■ the final result in stage 4, phase B3, therefore a 
45 further left shift is necessary for exponent adjust- 
ment prior to the final addition. operation. The result 
of the addition, intermediately stored in IR2, how- 
ever, has still to :be truncated and normalized. 
* .During this procedure a guard digit; if present has 
so to be removed and the final result has to be 
transferred to FE, the output register containing the 
final result FR. . . 

EX.8 (Fig. 13) . 

65 

In example EX.8 special situations caused by 
operand values different from those discussed in 
EX.7 are marked @ and © . 
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In @ an independent zeroing of the guard digit 
GD in FD is required which is done in stage 3, 
phase B3. 

In © it is necessary to truncate the fraction 
part of the operand in FC. This means that no left 
shift by one digit has to be made, so that only the 
n + 1 first digits come into addition. 

EX.9 (Fig. 14) 

In example EX.9 there is a special situation 
marked (§) shown where in stage 3, phase B2 the 
operand intermediately stored in FD -the result of 
the multiplication- requires an additional guard digit 
GD. As the exponents of both operands are already 
adjusted (both are 05), there is no subsequent shift 
operation required. 

EX.10 (Fig. 15) 

In the final example EX.10 it is shown under 
mark (?) how an exponent underflow is handled. An 
exponent underflow requires one bit (q) more and a 
cancellation of data feedback via path ND if a true 
zero situation was detected by the T/C unit. 

Claims 

1. Arrangement in a floating point processor com- 
prising 

- a multiply section (MS) having a first 
input register (MA) and a second input 
register (MB) for intermediately storing 
the operands (OPDI, OPDII) prior to a 
multiplication in a multiplier (M) the out- 
put of which is connected to adder out- 
put registers (PC, PS) for intermediately 
storing the partial sums of the multiplica- 
tion prior to their addition in a first adder 
(ADD-M), and a first normalizer (NORM- 
M) connected to the adder output for 
normalizing the sum (OPD2) of the partial 
sums; and 

- an add section (AS) having a third (FA) 
and a fourth input register (FB) for inter- 
mediately storing the operands (OPD1, 
OPD2) for addition, a first (FC) and a 
second adder input register (FD) for in- 
termediately storing the operands prior to 
their addition in a second adder (ADD-A), 
a first aligner (AL1) for operand (OPD1) 
alignment interconnected between said 
third input register and a 
true/complement unit (T/C) for operand 
true/complement building, which is con- 
nected to said first adder input register, a 
second aligner (AL2) connected to said 
fourth input register for operand (OPD2) 



alignment, . and a second normalizer 
(NORM-A) connected to said second ad- 
der's output for normalizing the final re- 
sult, 

5 characterized in, that for performing a fast mul- 

tiply-add instruction without requiring a wait 
cycle there is provided: 

- a feedback path (ND; Fig.4) connecting 
the output of said first normalizer to the 

10 input of said first and second aligner; 

- a first left shifter (SL1) interconnected 
between said first aligner and said 
true/complement unit; 

- a zero setter (Z1) interconnected be- 
15 tween the true/comptement unit and said 

first adder input register; and 

- a second left shifter (SL2) interconnected 
between said second aligner and a sec- 
ond zero setter (Z2) which itself is con- 

20 nected to said second adder input regis- 

ter. 

2. Method of performing a fast multiply-add in- 
struction without requiring a wait cycle in an 
25 arrangement in a floating point processor in 

particular in accordance with claim 1, char- 
acterized by the following steps: 

1. Read the operands (OPDI, OPDII) for 
multiplication into first (MA) and second in- 

30 put register (MB) of multiplier (M); 

2. Build the partial sums by the multiplier 
(M); 

3. Perform the product exponent calculation 
and reduce the exponent by 1 if the product 

35 has a leading zero, and at the same time 

read operand (OPD1) of the addition; 

4. Add the partial sums of the multiplication 
and feed the resulting intermediate value 
back to the aligners (AL1, AL2) via feed- 

40 back path (ND); 

5. Compare the exponents of the intermedi- 
ate value of the product and the addend 
(OPD1) and perform, if they do not compare 
a proper alignment; 

45 6. Test whether the following cases do ap- 

ply: 

a. If the product is true zero then the 
operand feedback from the multiplier is 
forced to zero; 
so b. If the intermediate product exponent is 

smaller than or equal to the addend 
operand (OPD1) exponent, then the prod- 
uct will be aligned; 

c. If the intermediate product exponent is 
55 greater than the addend operand expo- 

nent, then the addend will be aligned; 
7. If the product does not have a leading 
zero, then a potential guard digit of the 
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product is set to zero; 

8. If the product has a leading zero, then 
-both operands are shifted left by the shifters 
(SL1, SL2) by 1 digit and the least signifi- 
cant digit of the aligner becomes the guard 5 
digit of the result; 

9, When both operands are properly 
aligned, then they will be added by the 
second Adder (ADD-A) to the final result of 

the fast multiply-add instruction. w 
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